Import Data

Analysis of the dataset

1. Preprocessing

The first step in the analysis is the pre-processing stage. This refers to the series of operations that must be done to the text so that it's easier to analyze. The pre-processing part contains the following parts:

wordlist = data['Text'].apply(lambda text: fn.pre_processing_data(text))

data['text_words'] = wordlist

data.to_csv('clean_dataset.csv')

2. Create a vocabulary with all the unique words from text_words

In this step we are going to take into account the last column of the dataset that contains the pre-processed words of each review. From this list of words we are going to create a useful dictionary that maps each unique term into an integer

dictionary = fn.build_dictionary(data) - to build

save_dict(dictionary,'dictionary') - to save

3. For every unique product ID combine the list of words from different reviews together

We noticed in our first analysis that the number of unique products is much smaller than the number of rows in our dataframe. We so concluded that there were different reviews written by different users that referred to the same exact product. We decided to group all these reviews together by creating a dictionary that will be saved and then recalled as reviews_per_product that has for:

We will, as first thing, create two different lists: one that contains all the productsID of each row of the original dataset, and one, more important, that contains only the ID's of the unique products

new_data = fn.union_of_reviews_for_same_product(unique_products, data) - to build

save_dict(new_data,'new_data') - to save

Going on with our initial analysis of the dataset that aims to prepare a final output that will be convenient for the implementation of the K-Means algorithm, we want to calculate the frequence of each word for each productID

And then, we will use it to obtain the absolute frequence of each unique word, that we will save in a dictionary called frequencies

freq = fn.frequency(dictionary,frequency_of_word)

save_dict(freq,'frequencies') - to save

Now we have all the elements to make some considerations: we noticed that, from the original vocabulary, we have 109234 unique words. For the implementation of the algorithm, these words are too many. We want to find a way to reduce the number of words to consider by selecting only the ones that will be actually significant in the clusterization of the products

In order to do so, we first filtered the list of words by removing the one that have a really low frequence (<20) since are words too unusual, technical or maybe they could be wrong words, with spelling mistakes, hence, with no meaning. At the same time we want to remove words with a very high frequence (>200000) since these words are too common and will not help in dividing the products into similar groups, which is our final goal.

By doing this we will reduce the length of the dictionary, that will contain all the different unique words found inside the reviews, from len(dictionary) = 109234 to len(useful_words) = 14783

As we can see above, all the words with very high frequencies are very common words, so they will not be useful in the determination of the cluster in which a certain review will belong to.

4. Matrix product/word of occurrences

Now we want to create a matrix that will have the following structure:

Each cell of the matrix will represent the frequence of that certain word inside the reviews of that certain product

5. Singular Value Decomposition

In order to further reduce the dimentionality of our paramteres (words) we want to apply the SVD Method. This method basically operates a principal component analysis which is more specific for sparse matices, as the one that we have. When calling this function we need to specify the number of components that will explain a certain ammount o variability of our dataset. In general we should select this n_components such that the explained variability is still above 60 % of the total one. We have tried different values and found that 70 was a good balance between dimensionality reduction and percentage of explained variability.

Here we see how the output of this analysis is structured: it is an array with length = n_components and for each element we have the coefficients associated to the 14783 original variables. Therefore each dimension is a linear combination of input features. We decied to filter varibles by ignoring those that had a contribution to any dimension inferior to 0.01.

After selcting the most relevant words out of the total of unique words found, we have to update the dictionary reviews_per_product which associates a list of preprocessed words found in all the reviews concerning the same product, to the ProductID corresponding. The list now should contain only the words that were found in the reviews AND are inside the list of relevant_words that we extracted

We need of course to calculate again the frequency of those words...

... And create a final dictionary that maps those relevant words into integers

6. Calculate Tf-Idf Scores for each word inside the reviews of a product ID

tf_score = fn.tf(frequency_of_word,reviews_per_product)

idf_score = fn.idf(relevant_words,frequency_of_word,reviews_per_product)

tf_idf_scores = fn.tf_idf_score(tf_score,idf_score,reviews_per_product)

save_dict(tf_idf_scores,'tf_idf_scores') - to save

7. Creation of the final dataframe that consider the TF-IDF scores and will be used for the implementation of K-Means :

We will first initialize a matrix that will have as many rows as the number of unique product ID's that we have and as many columns as the number of relevant words that we found.

For each product we will assign a score to each word:

Finally, we can convert the result into a dataframe:

And we will save it so that we can import only this final step since it contains everything we need for the implementation of the K-Means Algorithm

d.to_csv('tf_idf_scores_dataframe.csv') - to save

K-means

Before applying the algorithm, we need to understand K, the number of clusters. This can be determined by the elbow method

We decided to select the best k using the following condition: we want the minimun number of clusters that explains at least 80% of the variance of the curve. The variance explained with the best k should be minimum: $(\textrm{variance with } k=1) - [(\textrm{variance with } k = 1) - (\textrm{variance with } k = 20)]*0.8$

We first define some functions that will be then recalled inside the proper algorithm: this functions will be useful to determine the initial points of each cluster, compute the distance between each point and each cluster, in order to understand in which cluster that point has to be inserted, and, finally, compute the variance at each iteration of the algorthm in order to understand when the best cluster is achieved.

All these functions were recalled in a main function k_means that computes the whole algorithm:

We know that, when the cluster variance pretty much stabilize, it means that the algorithm has found the best solution: the data points within each cluster stop changing. So, to take into account this phenomenon, we inserted a condition that considers the distance between the variation at step i and at step i-1, when it goes below a certain limit (that we fixed at 1.0), the algorithm stops.

We can visualize the change of viariance at each iteration from the graph below:

We will now create a new dataframe that has only three columns:

  1. the Product ID
  2. the list of words associated to the reviews of that specific product
  3. the cluster in which that product has been assigned from the K-Means algorithm

This new dataframe will be used in the analysis of the clusters, which is the next step of our project

Analysis of the clusters

Since we can retrieve the productID of the products inside the same cluster, we can now analyze the clusters a little bit deeper.

To do so we will need the new dataframe that we have created df and the original dataset Reviews that contains more information for each row

1. Identify the kind of products in the cluster (e.g., chips, tea, coffee) using a word cloud.

From what we can see in the word clouds below we can infere the following categories for each cluster:

  1. Choccolate and sauces
  2. Tea and drinks
  3. Dog food
  4. Spices
  5. Cat food
  6. Rice and cereals
  7. none
  8. Nuts
  9. Snacks
  10. Coffee

Most of the clusters actually have words that relate to each other and also suggest a specific category, while cluster 7 have many words that do not specify any category of food. Those products probably clustered on a centroid that had a review related to the quality of the service or the level of satisfaction of a product rather than the product itself. In reality we know that there are a lot of reviews that don't give any specification on the product, for those it's really hard to make any clustring. To support this idea we could analyze how many products fall in this categories and see if there is any difference with other clusters.

2. Provide the number of product in each cluster:

To answer this question we can group by the products based on the cluster they belonged.

As we can see, the cluster that has the majority of products associated to, is cluster number 7 which corrresponds to the following product IDs:

As we predicted in point 1, cluster 7 is peculiar since none of the words is actually suggesting a product category, on the contrary all words are pretty general and probably are used in the reviews of many different product categories. As a result, many more points are associated to this cluster since the words that are being considered for clusterization are not specific enough.

3. Compute the reviews' score distribution in each cluster

As we can see, although the number of products in each cluster in extremely different, the mean of the scores that the users give to the products in each cluster is basically the same, and is a very high value, always greater than 4

4. Get the number of unique users writing reviews in each cluster

Plotting results

PCA in 2 dimentions

Here we want to explain the distance among all points by taking the first two components from PCA.

In this plot we considered the first two components to describe our dataframe and we plotted each cloud according to the 2D coordinates produced by the PCA for each point. From here we can appreciate how some clusters, for example cluster 7, 3 and 1, have few or none outliers. This may suggest that the clustering of these products strongly depends on few variables (maybe two words).

PCA in 3 dimentions

Here we repeat the same approach as before but now using three components to explain our data.

In this plot we have used the first three principal components from our tf-idf dataframe. We associate a color to each point depending on the cluster they have been associated to by our k-means algorithm. With 3 components we can already see that the coloring is actualy strongly coherent with their point coordinate in most cases. Cluster 10 and cluster 2 are those whose variance is more poorly explained by three components. This results is actually consistent with the wordcloud plot since for those two clusters a low of words had an avarage size, meaning that more variables were contributing in defyning the cluster.

Comparing to built in k-means ++

Here we compare the built in K-means++ function to the one we have implemented considering the number of clusters and the inertia of these two clusterizazion.

The first important observation that we should make is that according to the way in which we decided to pick the optimal nuber of clusters, we should have 10 clusters, while the kmeans++ only build 8 clusters. Of course this difference is also reflected in the WSS score since it is higher in the kmeans++ rather then in our implementation. Another factor that could be responsible for this result is that in our algorithm we used the distance between one point and the centroids as the only parameter defyining the clustering, so we are sure that we are actually minimizing the wss score. On the other hand the built-in k-means also condsidered the mean distance of a point to all the other points in the cluster, which for sure makes the clusterization less sensivitve to an unefficient centroid selection but may increase the overall WSS score.

By comparing this plot with the one obtained from our clustering we can see that of course some components are associated to the same cluster since we only have 8, the clusters that are considered as a unique cluster are in this case cluster 8-9 (new 6) and 5-6 (new 3).

Clustering on distance matrix

Our first idea for the clustering was to build a distance matrix of all products but this approach was not convenient given the large number of elements we had to store. However we report the approach we could use if we had a dataset with less elements. This approach whould allow to save time when running the kmeans since we only need to access the distance between any two points rather than calculating it each time. This approach also allows to cluster not only according to the distance from a point to the centroid but also according to the mean distance of a specific point to any other point in that cluster. \ To explore more this approach we implemented the elbow method and the silhouette method for optimal k estimation, also we wanted to see how radom inizialization influences the clustering itself. For this purpose we will take into account only the first 5000 products and compute the distance matrix. \

matrix = functions.build_distance_matrix(final_dictionary) - to build dictionary \ matrix_df = pd.DataFrame(matrix) - covert to df \ matrix_df.to_csv('matrix.csv', index=False, header=False) - to save

K-means

K-means algorithm that is inizialized with random points and a number of clusters k. It returns a dictionary where the keys are the cluster ids, the ids of the random centroids, the clusters id associated to each centroid and the within cluster squared score.

Evaluating optimal number of clusters:

Elbow method:

By using fewer products (5000), we can see that clusterization is harder to make on few centroids. The silhouette score keeps increasing with increasing values of k meaning that not a good one is obtained with less than 24 clusters. The fact that the clusterization is not optimal on few products is also suggested by the fact that none of the cluster sizes in any of the iteration is significantly larger than the others, which in reality is usually not true.